Housing dataset from https://www.kaggle.com/datasets/vikramamin/housing-linear-regression
housing <- read.csv("housing.csv", dec = ".", stringsAsFactors = FALSE)
library(skimr)
skim(housing)
| Name | housing |
| Number of rows | 5000 |
| Number of columns | 7 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| numeric | 6 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| Address | 0 | 1 | 21 | 68 | 0 | 5000 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Avg..Area.Income | 0 | 1 | 68583.11 | 10657.99 | 17796.63 | 61480.56 | 68804.29 | 75783.34 | 107701.75 | ▁▁▇▆▁ |
| Avg..Area.House.Age | 0 | 1 | 5.98 | 0.99 | 2.64 | 5.32 | 5.97 | 6.65 | 9.52 | ▁▃▇▃▁ |
| Avg..Area.Number.of.Rooms | 0 | 1 | 6.99 | 1.01 | 3.24 | 6.30 | 7.00 | 7.67 | 10.76 | ▁▃▇▃▁ |
| Avg..Area.Number.of.Bedrooms | 0 | 1 | 3.98 | 1.23 | 2.00 | 3.14 | 4.05 | 4.49 | 6.50 | ▅▇▇▃▃ |
| Area.Population | 0 | 1 | 36163.52 | 9925.65 | 172.61 | 29403.93 | 36199.41 | 42861.29 | 69621.71 | ▁▃▇▅▁ |
| Price | 0 | 1 | 1232072.65 | 353117.63 | 15938.66 | 997577.14 | 1232669.38 | 1471210.20 | 2469065.59 | ▁▃▇▃▁ |
library(summarytools)
dfSummary(housing,plain.ascii = FALSE, style = "grid",tmp.img.dir = "/tmp",graph.magnif = 0.85)
Dimensions: 5000 x 7
Duplicates: 0
| No | Variable | Stats / Values | Freqs (% of Valid) | Graph | Valid | Missing |
|---|---|---|---|---|---|---|
| 1 | Avg..Area.Income [numeric] |
Mean (sd) : 68583.1 (10658) min < med < max: 17796.6 < 68804.3 < 107701.7 IQR (CV) : 14302.8 (0.2) |
5000 distinct values | 5000 (100.0%) |
0 (0.0%) |
|
| 2 | Avg..Area.House.Age [numeric] |
Mean (sd) : 6 (1) min < med < max: 2.6 < 6 < 9.5 IQR (CV) : 1.3 (0.2) |
5000 distinct values | 5000 (100.0%) |
0 (0.0%) |
|
| 3 | Avg..Area.Number.of.Rooms [numeric] |
Mean (sd) : 7 (1) min < med < max: 3.2 < 7 < 10.8 IQR (CV) : 1.4 (0.1) |
5000 distinct values | 5000 (100.0%) |
0 (0.0%) |
|
| 4 | Avg..Area.Number.of.Bedrooms [numeric] |
Mean (sd) : 4 (1.2) min < med < max: 2 < 4 < 6.5 IQR (CV) : 1.4 (0.3) |
255 distinct values | 5000 (100.0%) |
0 (0.0%) |
|
| 5 | Area.Population [numeric] |
Mean (sd) : 36163.5 (9925.7) min < med < max: 172.6 < 36199.4 < 69621.7 IQR (CV) : 13457.4 (0.3) |
5000 distinct values | 5000 (100.0%) |
0 (0.0%) |
|
| 6 | Price [numeric] |
Mean (sd) : 1232073 (353117.6) min < med < max: 15938.7 < 1232669 < 2469066 IQR (CV) : 473633.1 (0.3) |
5000 distinct values | 5000 (100.0%) |
0 (0.0%) |
|
| 7 | Address [character] |
1. 000 Adkins Crescent South 2. 000 Todd Pines Ashleyberg 3. 001 Steve Plaza Jessicast 4. 0010 Gregory Loaf South E 5. 00149 Raymond Knolls New 6. 002 Katherine Flat Hartma 7. 0022 Young Rest Lake Kevi 8. 0029 Melinda Neck Apt. 59 9. 003 Erica Passage Apt. 27 10. 003 Fernando Gateway Suit [ 4990 others ] |
1 ( 0.0%) 1 ( 0.0%) 1 ( 0.0%) 1 ( 0.0%) 1 ( 0.0%) 1 ( 0.0%) 1 ( 0.0%) 1 ( 0.0%) 1 ( 0.0%) 1 ( 0.0%) 4990 (99.8%) |
5000 (100.0%) |
0 (0.0%) |
sum(is.na(housing))
## [1] 0
sum(!complete.cases(housing))
## [1] 0
apply(is.na(housing), 2, sum)
## Avg..Area.Income Avg..Area.House.Age
## 0 0
## Avg..Area.Number.of.Rooms Avg..Area.Number.of.Bedrooms
## 0 0
## Area.Population Price
## 0 0
## Address
## 0
No empty rows found. The address can be discarded as it does not provide additional information. No factors need to be formed as all values are numeric.
The address is removed from the set because all values are distinct strings.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
housing <-housing %>%
select(c(
"Avg..Area.Income",
"Avg..Area.House.Age",
"Avg..Area.Number.of.Rooms",
"Avg..Area.Number.of.Bedrooms",
"Area.Population",
"Price"
))
The descriptive data analysis consists of presenting:
The target property Y=Price is a continuous variable.
Let’s visualize the distribution using a histogram.
Price_hist <- hist(
housing$Price,
col = "skyblue",
main = "Price frequency",
xlab="Price",
ylab = "Frequency"
)
The relative distribution:
par(mfrow=c(1,2))
Price_hist$counts <- Price_hist$counts / sum(Price_hist$counts) * 100
plot(
Price_hist,
col = "skyblue",
main = "Price % distribution",
xlab="Price",
ylab = "Frequency %",
cex.main=0.8
)
Price_box <- boxplot(
housing$Price,
col = "skyblue",
horizontal = 1,
main = "Price box plot",
xlab = "Price",
cex.main = 0.8,
range = 2
)
Price has outliers on both sides are:
sort(Price_box$out)
## [1] 15938.66 31140.52 2469065.59
The 2 outliers on the left side have the values of 16e3 and 31e3. The 1 outlier on the right has a value of 2.5e6; Details of the 3 outliers:
Price_out <- which(housing$Price %in% Price_box$out)
library(DT)
datatable(
housing[Price_out, ],
options=list(scrollX=1,pageLength=3,searching = FALSE,scroller = TRUE,scrollY=200)
)
Frequency table of Price (first the data has to be cut() into intervals based on the histogram).
intervals <- cut(housing$Price, Price_hist$breaks, include.lowest = TRUE)
col_frequency <- table(intervals)
col_frequency_rel <- prop.table(col_frequency) * 100
Price_ftable <- cbind(rownames(col_frequency), col_frequency, round(col_frequency_rel, 2), round(cumsum(col_frequency_rel), 2))
Price_ftable <- rbind(Price_ftable, c("Total", sum(col_frequency), sum(col_frequency_rel), ""))
Price_ftable <- data.frame(Price_ftable, row.names = NULL)
names(Price_ftable) <- c("Price", "Frequency", "Frequency %", "Cumulative f. %")
library(knitr)
library(kableExtra)
##
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
##
## group_rows
kable_styling(
kable(
Price_ftable,
align = c('l','r','r', 'r')
),
bootstrap_options = c("striped","hover"),
full_width = 0,
position = "left"
)
| Price | Frequency | Frequency % | Cumulative f. % |
|---|---|---|---|
| [0,2e+05] | 6 | 0.12 | 0.12 |
| (2e+05,4e+05] | 42 | 0.84 | 0.96 |
| (4e+05,6e+05] | 133 | 2.66 | 3.62 |
| (6e+05,8e+05] | 389 | 7.78 | 11.4 |
| (8e+05,1e+06] | 693 | 13.86 | 25.26 |
| (1e+06,1.2e+06] | 1054 | 21.08 | 46.34 |
| (1.2e+06,1.4e+06] | 1087 | 21.74 | 68.08 |
| (1.4e+06,1.6e+06] | 867 | 17.34 | 85.42 |
| (1.6e+06,1.8e+06] | 461 | 9.22 | 94.64 |
| (1.8e+06,2e+06] | 194 | 3.88 | 98.52 |
| (2e+06,2.2e+06] | 59 | 1.18 | 99.7 |
| (2.2e+06,2.4e+06] | 14 | 0.28 | 99.98 |
| (2.4e+06,2.6e+06] | 1 | 0.02 | 100 |
| Total | 5000 | 100 |
Numerical characteristics of the variable Price:
library(FSA)
## ## FSA v0.9.5. See citation('FSA') if used in publication.
## ## Run fishR() for related website and fishR('IFAR') for related book.
Price_summary <- Summarize(housing$Price, digits = 2)
kable_styling(kable(t(Price_summary)))
| n | mean | sd | min | Q1 | median | Q3 | max |
|---|---|---|---|---|---|---|---|
| 5000 | 1232073 | 353117.6 | 15938.66 | 997577.1 | 1232669 | 1471210 | 2469066 |
Price is quite symmetric as indicated by its mean (1232073) marginally exceeding the median (1232669). A majority, 42%, of the Price observations are clustered within a 1e6-1.5e6 range. The dataset of this variable ranges from a minimum of 15e3 to a maxiumum of 2.4e6.
The target property Y=Avg..Area.Income is a continuous variable.
Let’s visualize the distribution using a histogram.
Avg..Area.Income_hist <- hist(
housing$Avg..Area.Income,
col = "skyblue",
main = "Avg income frequency",
xlab="Avg income",
ylab = "Frequency",
breaks = 12
)
The relative distribution:
par(mfrow=c(1,2))
Avg..Area.Income_hist$counts <- Avg..Area.Income_hist$counts / sum(Avg..Area.Income_hist$counts) * 100
plot(
Avg..Area.Income_hist,
col = "skyblue",
main = "Avg income % distribution",
xlab="Avg income",
ylab = "Frequency %",
cex.main=0.8
)
Avg..Area.Income_box <- boxplot(
housing$Avg..Area.Income,
col = "skyblue",
horizontal = 1,
main = "Avg income box plot",
xlab = "Avg income",
cex.main = 0.8,
range = 1.9
)
Avg..Area.Income_hist has an asymmetric left side. The outliers are:
sort(Avg..Area.Income_box$out)
## [1] 17796.63 104702.72 107701.75
The outlier on the left side has a value of ~17769 and the values of the 2 outliers on right are 104e3 and 107e3. Details of the 3 outliers:
Avg..Area.Income_out <- which(housing$Avg..Area.Income %in% Avg..Area.Income_box$out)
library(DT)
datatable(
housing[Avg..Area.Income_out, ],
options=list(scrollX=1,pageLength=3,searching = FALSE,scroller = TRUE,scrollY=200)
)
Frequency table of Avg..Area.Income.
intervals <- cut(housing$Avg..Area.Income, Avg..Area.Income_hist$breaks, include.lowest = TRUE)
col_frequency <- table(intervals)
col_frequency_rel <- prop.table(col_frequency) * 100
Avg..Area.Income_ftable <- cbind(rownames(col_frequency), col_frequency, round(col_frequency_rel, 2), round(cumsum(col_frequency_rel), 2))
Avg..Area.Income_ftable <- rbind(Avg..Area.Income_ftable, c("Total", sum(col_frequency), sum(col_frequency_rel), ""))
Avg..Area.Income_ftable <- data.frame(Avg..Area.Income_ftable, row.names = NULL)
names(Avg..Area.Income_ftable) <- c("Avg..Area.Income", "Frequency", "Frequency %", "Cumulative f. %")
library(knitr)
library(kableExtra)
kable_styling(
kable(
Avg..Area.Income_ftable,
align = c('l','r','r', 'r')
),
bootstrap_options = c("striped","hover"),
full_width = 0,
position = "left"
)
| Avg..Area.Income | Frequency | Frequency % | Cumulative f. % |
|---|---|---|---|
| [1e+04,2e+04] | 1 | 0.02 | 0.02 |
| (2e+04,3e+04] | 0 | 0 | 0.02 |
| (3e+04,4e+04] | 18 | 0.36 | 0.38 |
| (4e+04,5e+04] | 197 | 3.94 | 4.32 |
| (5e+04,6e+04] | 822 | 16.44 | 20.76 |
| (6e+04,7e+04] | 1706 | 34.12 | 54.88 |
| (7e+04,8e+04] | 1566 | 31.32 | 86.2 |
| (8e+04,9e+04] | 588 | 11.76 | 97.96 |
| (9e+04,1e+05] | 95 | 1.9 | 99.86 |
| (1e+05,1.1e+05] | 7 | 0.14 | 100 |
| Total | 5000 | 100 |
Numerical characteristics of the variable Avg..Area.Income:
library(FSA)
Avg..Area.Income_summary <- Summarize(housing$Avg..Area.Income, digits = 2)
kable_styling(kable(t(Avg..Area.Income_summary)))
| n | mean | sd | min | Q1 | median | Q3 | max |
|---|---|---|---|---|---|---|---|
| 5000 | 68583.11 | 10657.99 | 17796.63 | 61480.56 | 68804.29 | 75783.34 | 107701.8 |
Avg..Area.Income is quite symmetric, with a slight left skew, as indicated by its mean (68583) marginally deceeding the median (68804). A majority, 65%, of the Avg..Area.Income observations are clustered within a 60k to 80k range. The dataset of this variable ranges from a minimum of 17e3 to a maxiumum of 107e3.
The target property Y=Avg..Area.House.Age is a continuous variable.
Let’s visualize the distribution using a histogram.
Avg..Area.House.Age_hist <- hist(
housing$Avg..Area.House.Age,
col = "skyblue",
main = "Avg house age frequency",
xlab="Avg house age",
ylab = "Frequency"
)
The relative distribution:
par(mfrow=c(1,2))
Avg..Area.House.Age_hist$counts <- Avg..Area.House.Age_hist$counts / sum(Avg..Area.House.Age_hist$counts) * 100
plot(
Avg..Area.House.Age_hist,
col = "skyblue",
main = "Avg house age % distribution",
xlab="Avg house age",
ylab = "Frequency %",
cex.main=0.8
)
Avg..Area.House.Age_box <- boxplot(
housing$Avg..Area.House.Age,
col = "skyblue",
horizontal = 1,
main = "Avg house age box plot",
xlab = "Avg house age",
cex.main = 0.8,
range = 1.6
)
Avg..Area.House.Age has a slight asymmetric right side. The outliers are:
sort(Avg..Area.House.Age_box$out)
## [1] 2.644304 2.683043 2.797215 2.797619 2.922736 3.004717 3.055370 3.105751
## [9] 3.118802 3.144894 8.899713 8.916093 8.973441 8.991399 9.008900 9.125283
## [17] 9.519088
The outliers are in the ranges of ~2.64 to ~3.14 (in total 10) and ~8.92 to ~9.51 (in total 7). Details of the 17 outliers:
Avg..Area.House.Age_out <- which(housing$Avg..Area.House.Age %in% Avg..Area.House.Age_box$out)
library(DT)
datatable(
housing[Avg..Area.House.Age_out, ],
options=list(scrollX=1,pageLength=7,searching = FALSE,scroller = TRUE,scrollY=200)
)
The target property Y=Avg..Area.Number.of.Rooms is a continuous variable.
Let’s visualize the distribution using a histogram.
Avg..Area.Number.of.Rooms_hist <- hist(
housing$Avg..Area.Number.of.Rooms,
col = "skyblue",
main = "Avg rooms frequency",
xlab="Avg rooms",
ylab = "Frequency"
)
The relative distribution:
par(mfrow=c(1,2))
Avg..Area.Number.of.Rooms_hist$counts <- Avg..Area.Number.of.Rooms_hist$counts / sum(Avg..Area.Number.of.Rooms_hist$counts) * 100
plot(
Avg..Area.Number.of.Rooms_hist,
col = "skyblue",
main = "Avg rooms % distribution",
xlab="Avg rooms",
ylab = "Frequency %",
cex.main=0.8
)
Avg..Area.Number.of.Rooms_box <- boxplot(
housing$Avg..Area.Number.of.Rooms,
col = "skyblue",
horizontal = 1,
main = "Avg rooms box plot",
xlab = "Avg rooms",
cex.main = 0.8,
range = 1.65
)
Avg..Area.Number.of.Rooms has a slight asymmetric left side. The outliers are:
sort(Avg..Area.Number.of.Rooms_box$out)
## [1] 3.236194 3.950225 3.950973 3.969632 4.027931 9.926147 10.024375
## [8] 10.144988 10.219902 10.280022 10.759588
The outliers are in the ranges of ~3.23 to ~4.03 and ~9.93 to ~10.6. Details of the outliers:
Avg..Area.Number.of.Rooms_out <- which(housing$Avg..Area.Number.of.Rooms %in% Avg..Area.Number.of.Rooms_box$out)
library(DT)
datatable(
housing[Avg..Area.House.Age_out, ],
options=list(scrollX=1,pageLength=7,searching = FALSE,scroller = TRUE,scrollY=200)
)
The target property Y=Avg..Area.Number.of.Bedrooms is a continuous variable.
Let’s visualize the distribution using a histogram.
Avg..Area.Number.of.Bedrooms_hist <- hist(
housing$Avg..Area.Number.of.Bedrooms,
col = "skyblue",
main = "Avg bedrooms frequency",
xlab="Avg bedrooms",
ylab = "Frequency",
breaks = c(2, 3, 4, 5, 6, 7)
)
The relative distribution:
par(mfrow=c(1,2))
Avg..Area.Number.of.Bedrooms_hist$counts <- Avg..Area.Number.of.Bedrooms_hist$counts / sum(Avg..Area.Number.of.Bedrooms_hist$counts) * 100
plot(
Avg..Area.Number.of.Bedrooms_hist,
col = "skyblue",
main = "Avg bedrooms % distribution",
xlab="Avg bedrooms",
ylab = "Frequency %",
cex.main=0.8
)
Avg..Area.Number.of.Bedrooms_box <- boxplot(
housing$Avg..Area.Number.of.Bedrooms,
col = "skyblue",
horizontal = 1,
main = "Avg bedrooms box plot",
xlab = "Avg bedrooms",
cex.main = 0.8,
range = 1.5
)
Avg..Area.Number.of.Bedrooms has a slight asymmetric right side. No outliers were detected:
sort(Avg..Area.Number.of.Bedrooms_box$out)
## numeric(0)
The target property Y=Area.Population is a continuous variable.
Let’s visualize the distribution using a histogram.
Area.Population_hist <- hist(
housing$Area.Population,
col = "skyblue",
main = "Area population frequency",
xlab="Population",
ylab = "Frequency"
)
The relative distribution:
par(mfrow=c(1,2))
Area.Population_hist$counts <- Area.Population_hist$counts / sum(Area.Population_hist$counts) * 100
plot(
Area.Population_hist,
col = "skyblue",
main = "Area population % distribution",
xlab="Population",
ylab = "Frequency %",
cex.main=0.8
)
Area.Population_box <- boxplot(
housing$Area.Population,
col = "skyblue",
horizontal = 1,
main = "Area population box plot",
xlab = "Population rooms",
cex.main = 0.8,
range = 1.5
)
Area.Population has a slight asymmetric left side. The outliers are:
sort(Area.Population_box$out)
## [1] 172.6107 3285.4505 3883.4482 4114.4894 5727.4859 6248.7561
## [7] 6805.7408 6821.9502 7234.9635 7360.2952 7522.3331 9193.8332
## [13] 63184.6131 63620.0120 64149.6802 64180.3708 64543.3224 64566.6874
## [19] 65184.5785 65857.9333 66995.4740 67353.9652 67601.2236 67701.6498
## [25] 67727.2291 68311.6958 69553.9883 69575.4495 69592.0402 69621.7134
The outliers are in the ranges of ~170 to ~9200 and ~63000 to ~7000. Details of the outliers:
Area.Population_out <- which(housing$Area.Population %in% Area.Population_box$out)
library(DT)
datatable(
housing[Area.Population_out, ],
options=list(scrollX=1,pageLength=7,searching = FALSE,scroller = TRUE,scrollY=200)
)
In total 93 outliers were found.
all_out <- c(Price_out, Avg..Area.Income_out, Avg..Area.Number.of.Rooms_out, Avg..Area.House.Age_out, Area.Population_out)
library(DT)
datatable(
housing[all_out, ],
options=list(scrollX=1,pageLength=7,searching = FALSE,scroller = TRUE,scrollY=200)
)
# Total of 90 outliers were removed from the _housing_ dataset.
# housing <- housing[-all_out, ]
The target variable for analysis is Price. Visualization of all the numeric variable distributions with a plot().
plot(Filter(is.numeric, housing), pch=20, col='blue', cex=.5)
Looks like The price grows linearly to all variables except Avg..Area.Number.Of.Bedrooms, but everything is quite scattered. The independant variables data is scattered, but is still linearly growing. The most promising/clean linear incline to Price happens with Avg..Area.Income.
Looking at the relationship between Avg..Area.Income and Price using a ggplot.
library(ggplot2)
housing %>%
ggplot(aes(x = Price, y = Avg..Area.Income)) +
geom_point() +
xlab("Price ($)") +
ylab("Avg Area Income ($/year)")
We can see that the Price increases when Avg..Area.Income is increasing. Let’s add a color to Area.Population.
library(ggplot2)
housing %>%
ggplot(aes(x = Price, y = Avg..Area.Income, color = Area.Population)) +
geom_point() +
xlab("Price ($)") +
ylab("Avg Area Income ($/year)")
housing %>%
ggplot(aes(y = Price, x = Avg..Area.Income, color = Area.Population)) +
geom_point() +
ylab("Price ($)") +
xlab("Avg Area Income ($/year)")
We can see that the price increases when Avg..Area.Income is
increasing. In lower population areas the price is lower.
Plots of filtered data:
library(dplyr)
# TODO: find all permutations of independent vars to find homoscedastic relations
# intervals for every independant variable
# x loop through each independant variable and each interval
# y loop through anothe independant variable
plot(Filter(is.numeric, housing) %>%
filter(between(Avg..Area.Number.of.Rooms, 3, 5)) %>%
select(c(
"Price",
"Avg..Area.Income",
"Avg..Area.House.Age",
"Area.Population"
)), pch=20, col='blue', cex=.5, main="3-5 rooms")
plot(Filter(is.numeric, housing) %>%
filter(between(Avg..Area.Number.of.Rooms, 5, 6)) %>%
select(c(
"Price",
"Avg..Area.Income",
"Avg..Area.House.Age",
"Area.Population"
)), pch=20, col='blue', cex=.5, main="5-6 rooms")
plot(Filter(is.numeric, housing) %>%
filter(between(Avg..Area.Number.of.Rooms, 6, 7)) %>%
select(c(
"Price",
"Avg..Area.Income",
"Avg..Area.House.Age",
"Area.Population"
)), pch=20, col='blue', cex=.5, main="6-7 rooms")
plot(Filter(is.numeric, housing) %>%
filter(between(Avg..Area.Number.of.Rooms, 7, 8)) %>%
select(c(
"Price",
"Avg..Area.Income",
"Avg..Area.House.Age",
"Area.Population"
)), pch=20, col='blue', cex=.5, main="7-8 rooms")
plot(Filter(is.numeric, housing) %>%
filter(round(Avg..Area.Number.of.Rooms) == 7) %>%
select(c(
"Price",
"Avg..Area.Income",
"Avg..Area.House.Age",
"Area.Population"
)), pch=20, col='blue', cex=.5, main="7 rooms")
plot(Filter(is.numeric, housing) %>%
filter(between(Avg..Area.Number.of.Rooms, 9, 11)) %>%
select(c(
"Price",
"Avg..Area.Income",
"Avg..Area.House.Age",
"Area.Population"
)), pch=20, col='blue', cex=.5, main="9-11 rooms")
See on histograms, how the data is distributed.
price_hist <- hist(
housing$Price,
col = "skyblue",
main = "Price frequency",
xlab="$",
ylab = "Count"
)
par(mfrow=c(1,2))
price_hist$counts <- price_hist$counts / sum(price_hist$counts) * 100
price_hist$counts
## [1] 0.12 0.84 2.66 7.78 13.86 21.08 21.74 17.34 9.22 3.88 1.18 0.28
## [13] 0.02
plot(
price_hist,
col = "skyblue",
main = "Price % distribution",
xlab="$",
ylab = "Frequency %",
cex.main=0.8
)
price_box <- boxplot(
housing$Price,
col = "skyblue",
horizontal = 1,
main = "Price box plot",
xlab = "$",
range = 1.5,
cex.main = 0.8
)
price_box_out <- which(housing$Price %in% price_box$out)
price_box_out
## [1] 91 257 264 356 466 623 694 697 716 902 925 991 1209 1249 1272
## [16] 1357 1460 1486 1517 1537 1579 1662 1800 2301 2539 2720 2757 2796 3092 3213
## [31] 3503 3923 4130 4401 4452
library(DT)
datatable(
housing[price_box_out, ],
options=list(scrollX=1,pageLength=10,searching = FALSE,scroller = TRUE,scrollY=200)
)
avg_age_hist <- hist(
housing$Avg..Area.House.Age,
col = "skyblue",
main = "House age frequency",
xlab="Years",
ylab = "Count"
)
Also the relative distribution
par(mfrow=c(1,2))
avg_age_hist$counts <- avg_age_hist$counts / sum(avg_age_hist$counts) * 100
avg_age_hist$counts
## [1] 0.10 0.52 1.62 4.84 9.08 15.50 19.48 19.00 14.36 9.00 4.32 1.72
## [13] 0.40 0.04 0.02
plot(
avg_age_hist,
col = "skyblue",
main = "House age % distribution",
xlab="Years",
ylab = "Frequency %",
cex.main=0.8
)
avg_rooms_box <- boxplot(
housing$Avg..Area.House.Age,
col = "skyblue",
horizontal = 1,
main = "House age box plot",
xlab = "Years",
range = 2,
cex.main = 0.8
)
boxplot
## function (x, ...)
## UseMethod("boxplot")
## <bytecode: 0x55c317bb64c8>
## <environment: namespace:graphics>
Outliers of the Avg..Area.House.Age.
sort(avg_rooms_box$out)
## [1] 2.644304 9.519088
avg_rooms_box_out <- which(housing$Avg..Area.House.Age %in% avg_rooms_box$out)
avg_rooms_box_out
## [1] 1075 3990
library(DT)
datatable(
housing[avg_rooms_box_out, ],
options=list(scrollX=1,pageLenght=5,searching = FALSE,scroller = TRUE,scrollY=200)
)
Lets compare the Price with the discrete variables using box plots. The number of rooms and the age of the house are discrete if rounded.
par(mfrow=c(1,2))
discrete_vars = c(
"Avg..Area.House.Age",
"Avg..Area.Number.of.Rooms",
"Avg..Area.Number.of.Bedrooms",
"Area.Population",
"Avg..Area.Income"
)
for (i in 1:length(discrete_vars)) {
boxplot(
housing$Price~round(housing[,discrete_vars[i]]),
col="skyblue",
xlab ="",
ylab ="",
las=2,
cex.axis=0.5,
cex.lab=0.7,
main=paste("Price vs" ,discrete_vars[i]),
range=2
)
}
Let’s get the summary for all of our variables with Price being our dependent variable and the other variables being independent.
model <- lm(Price ~ Avg..Area.Income + Avg..Area.House.Age + Avg..Area.Number.of.Rooms + Avg..Area.Number.of.Bedrooms + Area.Population, data = housing)
summary(model)
##
## Call:
## lm(formula = Price ~ Avg..Area.Income + Avg..Area.House.Age +
## Avg..Area.Number.of.Rooms + Avg..Area.Number.of.Bedrooms +
## Area.Population, data = housing)
##
## Residuals:
## Min 1Q Median 3Q Max
## -337020 -70236 320 69175 361870
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.637e+06 1.716e+04 -153.708 <2e-16 ***
## Avg..Area.Income 2.158e+01 1.343e-01 160.656 <2e-16 ***
## Avg..Area.House.Age 1.656e+05 1.443e+03 114.754 <2e-16 ***
## Avg..Area.Number.of.Rooms 1.207e+05 1.605e+03 75.170 <2e-16 ***
## Avg..Area.Number.of.Bedrooms 1.651e+03 1.309e+03 1.262 0.207
## Area.Population 1.520e+01 1.442e-01 105.393 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 101200 on 4994 degrees of freedom
## Multiple R-squared: 0.918, Adjusted R-squared: 0.9179
## F-statistic: 1.119e+04 on 5 and 4994 DF, p-value: < 2.2e-16
We can see that “Average Area Income” coefficient does indeed have the highest effect on price, followed by “Average Area House Age” and “Area Population”.
Let’s show graphs for all variables:
plot(housing$Avg..Area.Income, housing$Price, main="House Price vs Income", xlab="Income", ylab="House Price")
model_HouseIncome <- lm(Price ~ Avg..Area.Income, data = housing)
abline(model_HouseIncome, col="red")
plot(housing$Avg..Area.House.Age, housing$Price, main="House Price vs House Age", xlab="House Age", ylab="House Price")
model_HouseAge <- lm(Price ~ Avg..Area.House.Age, data = housing)
abline(model_HouseAge, col="red")
plot(housing$Avg..Area.Number.of.Rooms, housing$Price, main="House Price vs Rooms", xlab="Rooms", ylab="House Price")
model_HouseRooms <- lm(Price ~ Avg..Area.Number.of.Rooms, data = housing)
abline(model_HouseRooms, col="red")
plot(housing$Avg..Area.Number.of.Bedrooms, housing$Price, main="House Price vs Bedrooms", xlab="Bedrooms", ylab="House Price")
model_HouseBedrooms <- lm(Price ~ Avg..Area.Number.of.Bedrooms, data = housing)
abline(model_HouseBedrooms, col="red")
plot(housing$Area.Population, housing$Price, main="House Price vs Population", xlab="Population", ylab="House Price")
model_Population <- lm(Price ~ Area.Population, data = housing)
abline(model_Population, col="red")
We can see from the graphs that “Average Income” has the most significant increase in price with “Average House Age” being the second most impactful.
There is a slight increase in price with the other 3 variables but it is minimal.
Data shows that the more money people earn the more they’re willing to spend on housing.
There are outliers but on average a person spends roughly:
housing$Percentage <- (housing$Price / housing$Avg..Area.Income) * 100
average_percentage <- mean(housing$Percentage)
average_percentage
## [1] 1788.444
-Percent of their yearly income on housing.
This means that in order to buy the houses they need to save up for 17.88 years of their salary or get a house loan with a duration of 20 years or longer.